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Objectives and Scope 



Course Goals 

Introduction to EXL DA Methodology 

Provide a structured overview of concepts relating to preliminary data exploration and data preparation as 
applied under EXL DA methodology 

Explain various alternative techniques of outlier treatment and missing value imputation 
Provide helpful “tricks of the trade” 


Beyond the Scope of this Training 

Comprehensive coaching on Data Analysis and Preparation 
Technique-specific algorithms (unless required as part of methodology explanation) 


Self Study Goals 

In-depth research on SAS implementation of explained techniques 
Innovations and new techniques related to methodology 
Discussion on advanced concepts can be taken up offline 
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EXL Decision Analytics Methodology Snapshot 



We apply a set of highly effective tools, techniques and best practices for the end-to-end model development cycle 


Stage 1 


Stage 2 


Stage 3 


Stage 4 


Preliminary Data 
Exploration 



Variable Creation 


Variable Reduction 


Univariate Analysis (EDD*) 
Modeling and Validation Split 
Bivariate Analysis 

Outlier Treatment 
Missing Imputation 
Roll Ups and Data Merge 


These stages 
demand lot of manual 
effort in analyzing 
and understanding 
each and every 
variable 


Dummy Variable Creation 
Binning and Banding 
Transformations 
Interactions and Groupings 

Variable Clustering 

Inter-Correlation Analysis 

Variance Inflation Factor Test 


These stages require 
business sense and 
out-of-box thinking 
for brainstorming on 
creating hypothesis- 
based variables and 
dropping redundant 
features 


Stage 5 Modeling 


Stage 6 


Validation and 
Stabilization 
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Modeling Technique Selection 

Model Improvements 

Ensemble 


In-Sample Validation 

Out-of-Time Validation 

Bootstrapping 

Coefficient Blasting 

* Extended Data Dictionary 



These stages require 
good knowledge of 
statistical techniques 
for providing high- 
end quality solutions 
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Chapter 1: Preliminary Data Exploration 

Data Understanding and Prerequisites for Data Preparation 
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1.1 Data Collection Process 


5. Quality Check 


1. Identify Data Needs 


2. Data Mapping 

Interview clients 



A B C D E F G 

1 

2 

3 

Study data layouts 



3. Plan Data Request 


4. Send Data Request 

Xexl 


Do data sanity checks and 
preliminary data analysis 



Keep in mind population to 
be covered and alternative 
data sources 


List the data requirements 
clearly and precisely 


Understand problem 
statement 



Check data samples 
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1.2 Data Dictionary 

A comprehensive data dictionary should be maintained and updated as and when any new information is gathered. 

USE: It can go a long way in helping us understand the data better. For instance, it can help us to revisit old information and see what our 
initial hypothesis was and how it is changing with the new updated information. 


THINGS TO INCLUDE IN THE DATA DICTIONARY: 


■ Meaning of all Potential Predictors: 

Maintain labels of as many variables as possible 

If possible, one should also try to capture the business sense of these variables 

Wherever things are not clear, it should be noted down so that it can be clarified with the client later on 

■ Clear Definition of Unique Identifier and its Meaning: 

Ascertain the level at which data is to be rolled up / down. For instance, 

o Individual level 

o Individual x Account level 

Individual x Month level 
o Individual x Account x Month level, etc. 

Identify unique key of every dataset. Few examples below: 

Payment data may be at transaction level 

o Demographic data at individual level 

Census data at zip code level 


■ Dependent Variable Definition and Meaning: This is a very crucial step in modeling exercise as wrong definition can lead to completely wrong 
conclusions. In absence of a clear definition at this stage, it may be defined later after some actual data analysis. 


■ Variable Classification: If not already given, one should always try and classify the variables like 
Demographic variables, e.g. age, gender 
Performance variables, e.g. spend, number of transactions 
Credit Attributes, e.g. total credit line, FICO score 

Census level, e.g. population, location attributes such as income levels 

8 | June 30, 2015 | © 2015 ExIService Holdings, Inc. 


Xexl 



1.3 Modeling and Validation Split 


To start the modeling process, there is a need to create modeling and validation datasets. 
Validation dataset helps validate the performance of the model which is built using the modeling 
dataset. A poor performance on validation dataset would imply that the model is not robust. 



Diagram Scale: Based on 60:40 split assumption 


Master / Full 
Training Dataset 



Modeling / Training 
Dataset 


Out-of-Sample 

Validation 


Test Dataset 



Out-of-Time 

Validation 


Data Preparation stage helps us create the 
master dataset for modeling and validation. 
Note that apart from out-of-sample validation, a 
fully independent out-of-time validation sample 
is also necessary to test the robustness of the 
model. 


Step 1 : Before we start the modeling process, we need to define 
and create the modeling population. From the data that is 
shared by the client, depending upon the scope of the analysis, 
an assessment of the required data (a certain amount of history, 
a certain length of future for prediction, quality of data, etc.), list 
down the defining criterion for eligible population. 

Step 2: Split the final eligible population into parts - modeling 
dataset (also called training dataset) and validation datasets. 
This can be done using 

a random assessment (60:40 split or 80:20 split); or 
specific splitting criterion (based on time/segments) 

It is important to validate our model for performance on data 
which was not used to build the model, but is the expected data 
that will be encountered in live environment - and hence, the 
need for validation dataset. 
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1.4 Univariate Analysis through EDD 

The ;DD, or Extended Data Dictionary macro produces a summary of the variables present in a 
dataset. It is a comprehensive and complete view of all variables with the following information being 
present. 

Number of observations present (numobs) 

Number of observations missing (nmiss) 

Number of unique values for a variable 

Mean, standard deviation, minimum, maximum and percentile distribution of numeric variables (mean, stddev, 
min, pi, p5, p25, median, p75, p95, p99, max) 

Six-most frequently occurring and five-least frequently occurring values for character variables 



Syntax: 


LIBNAME catalog "<path of the catalogued'; 
OPTIONS mstored sasmstore = catalog; 

%EDD ( 


INLIB 

INPUTDATA 

EDD_OUT_LOC_XLS 

OUTLIB 

OUTDATA 

NUM_UNIQ 


= <Location of the input dataset>, 

= <Name of input dataset>, 

= <Name and location of the output XLS filed 
= CLocation of the output dataset>, 

= <Name of output dataset>, 

= <Option>* 


); 

*NUM_UNIQ can either be Y or N depending on whether the # of unique values column is desired. 
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1.5 Bivariate Analysis 



Unlike Univariate Analysis that involves standalone analysis of a variable (independent variable) distribution, the 
Bivariate profiling is a simultaneous analysis of two variables (a dependent and an independent variable) 

Bivariate profiling involves 

Dividing independent variable into different categories (or bins) 

Analyzing trend in sizing (percentage of records) across categories (or bins) 

Analyzing trend in mean value of dependent variable across categories (or bins) 

Illustration: 

Dependent Variable : Number of Days Spent by Patient in Hospital in a Year 
Independent Variable : Specialty of the Doctor 



On average, patients of Pediatrics spend very 
less number of days in hospital 

On the other hand, Diagnostic Imaging and 
Emergency cases spend longer time 

Majority of cases pertain to Internal, 
Laboratory and General Practice type 


Xexl 


| June 30, 2015 | © 2015 ExIService Holdings, Inc. 




















Chapter 2: Data Prep: Outlier Treatment 
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2. Outlier Treatment 


An outlier is a single observation “far away” from rest of the data. 

REASONS FOR OUTLIERS: 

■ Errors 

Data errors 
Sampling error 
Standardization failure 
Faulty distributional assumptions 
Human Error 

■ Genuine Outliers 


WHY DO WE CARE ABOUT OUTLIERS? 

■ Outliers are BAD 

The presence of outliers can lead to inflated error rates and substantial distortions of results that can lead to wrong conclusions and 
inferences. 

■ Outliers are GOOD 

The outliers can provide useful information in the data, for example, a spike in spend behavior of some customers may prove to be the 
deciding factor in marketing response campaigns. So care should be taken while dealing with outliers. 

In short, outliers are important and hence should not be ignored. 



TECHNIQUES FOR OUTLIER DETECTION / TREATMENT: 

■ Capping and Flooring Technique 

■ Exponential Smoothing Technique 

■ Sigma Approach 

■ Robust Regression Technique 

■ Mahalanobis Distance Technique 
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2.1 Capping and Flooring Technique 



OUTLIER 

DETECTION/TREATMENT 

TECHNIQUES 


| A. Capping and Flooring Technique 

B. Exponential Smoothing Technique 

C. Sigma Approach 

D. Robust Regression Technique 

E. Mahalanobis Distance Technique 


Technique 

Description 

In this technique, the outliers are identified and treated based 
upon the values of P99 and PI . 

Outlier 

Identification 

Outlier is defined as the value falling out by ‘x’ times P99 or ‘y’ 
times PI . 

Note: x and y are the factors which can take value any integer value as 
required by the data distribution and as decided based on the application. 

Outlier 

Treatment 

■ Capping - All values falling higher than ‘x’ times P99, are 
capped at the value “x * P99'. 

■ Flooring - All values less than ‘y’ times PI , are floored at the 
value “y * PI”. 

Note: 

- Values are capped at x * P99 when P99 and PMax both are positive. 

- Values are floored at y * PI when PI and PMin both are negative. 

Advantages 

■ Easy to understand & implement 

■ Run time is less 

Disadvantages 

■ Distribution of the data is not taken into account while 
identifying the outliers 

■ Rank order is not maintained 

_ )su 


XL 

tookdeeper 
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2.2 Exponential Smoothing Technique 



OUTLIER 

DETECTION/TREATMENT 

TECHNIQUES 


A. Capping and Flooring Technique 


| B. Exponential Smoothing Technique 


C. Sigma Approach 


D. Robust Regression Technique 


E. Mahalanobis Distance Technique 


Technique 

Description 

In this technique, the curve between P95 to P99 is extrapolated 
beyond P99, to identify the values falling above the curve. The 
values falling outside the curve are outliers and are treated 
according to some functions depending upon the boundary 
conditions. 

Outlier 

Identification 

Curve between P95 and P99 is extrapolated beyond P99. The 
values between P99 and PMax which fall outside this curve are 
termed as outliers. 

Note: Similar approach is followed for values between PMin and PI. 

Outlier 

Treatment 

Based upon the boundary condition a specific function is used to 
treat the outlier and maintain the rank order. 

Advantages 

■ Rank order is maintained 

■ Distribution of data is taken into account while identifying the 
outliers 

Disadvantages 

■ Functions involved in treating the outliers are quite complex 

_ 


XL 
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2.3 Sigma Approach 



OUTLIER 

DETECTION/TREATMENT 

TECHNIQUES 


A. Capping and Flooring Technique 

B. Exponential Smoothing Technique 

■-1 

| C. Sigma Approach 

D. Robust Regression Technique 

E. Mahalanobis Distance Technique 


Technique 

Description 

In this technique, the outliers are identified and treated based 
upon the values of mean and standard deviation. 

Outlier 

Outlier is defined as the value falling out of mean V or ‘x’ 
times sigma (standard deviation) 

Identification 

Note: x is the factor which can take value any integer value as required by 
the data distribution and as decided specifically. 

Outlier 

■ Capping - All values falling higher than mean plus ‘x’ times 
sigma, are capped at the value “mean + x * sigma”. 

Treatment 

■ Flooring - All values less than mean minus ‘x’ times sigma, 
are floored at the value “mean -x * sigma”. 

Advantages 

■ Easy to understand & implement 


■ Rank order is not maintained 

Disadvantages 

■ This method works best only when variables follow a normal 
distribution 

--V, 
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2.4 Robust Regression Technique 



OUTLIER 

DETECTION/TREATMENT 

TECHNIQUES 


A. Capping and Flooring Technique 


B. Exponential Smoothing Technique 


C. Sigma Approach 


j D. Robust Regression Technique 

E. Mahalanobis Distance Technique 


Technique 

Description 

This technique involves running regression repeatedly to identify 
outliers by assigning weights to the observations. The weights 
are on the basis of the prediction error (residual) in different 
iterations. 


Note: Higher residual means lower weight. 

Outlier 

Identification 

■ Weights are assigned to each observation, based on the 
normalized residual value 

■ High breakdown value method is used — it is the measure of 
the contamination in the data that an estimation can withstand 
and still maintain its robustness. 

Outlier 

Treatment 

The outliers are ignored (deleted) while making the model. 

Advantages 

■ Effect of outliers on model performance is minimized 

Disadvantages 

■ Computation of robust estimates is resource intensive 

■ Ignoring outliers may result in loss of data/information 

- it. 
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2.5 Mahalanobis Distance Technique 



OUTLIER 

DETECTION/TREATMENT 

TECHNIQUES 


A. Capping and Flooring Technique 

B. Exponential Smoothing Technique 

C. Sigma Approach 

D. Robust Regression Technique 

I-1 

| E. Mahalanobis Distance Technique 


Technique 

Description 

In this technique, the outliers are identified by the magnitude of 
‘Mahalanobis ’or statistical distance from the origin. 

Weights are given to each observation as the inverse of 
‘Mahalanobis’ distance. 

Outlier 

Identification 

The observations with extremely low weights can be considered 

as outliers. 

Outlier 

Treatment 

Weighted regression is run to take into account the effect of 
weight given to each observation. 

Greater the ‘Mahalanobis’ distance, lesser is the weight of that 
observation & hence lesser the contribution of that observation 
in final model. 

Advantages 

■ Effect of outliers on model performance is minimized 

Disadvantages 

■ Small change in data distribution could lead to more than 
normal deterioration of the model performance 

■ Complexity in calculation of ‘Mahalanobis’ distance for 
weighted regression 

_JflL_ 


XL 

(ookdeDper 


| June 30, 2015 | © 2015 ExIService Holdings, Inc. 




























2.6 Summary 


E 



Comparison Summary 


Technique 

Description 

A. Capping and 
Flooring 
Technique 

B. Exponential 
Smoothing 
Technique 

C. Sigma Approach 

D. Robust 
Regression 
Technique 

E. Mahalanobis 
Distance 
Technique 

Outlier 

■ Multiples of P99 

■ Extrapolation of 

■ Multiple of standard 

■ Assigning weights 

■ Mahalanobis 

Identification 

and PI 

distribution 

deviation from mean 

to observations 

distance from origin 

Outlier 

Capping and 

Boundary 

Capping and flooring 

Outliers ignored in 

Outliers given lower 

Treatment 

flooring at 
multiples of P99 
and PI 

condition of 

exponential 

function 

at multiples of 
standard deviation 
from mean 

modeling 

weights 

Advantages 

■ Easy to understand 
and implement 

■ Run time is less 

■ Rank ordering 

■ Considers 
distribution 

■ Easy to understand 
and implement 

■ Effect of outliers 
minimized 

■ Effect of outliers 
minimized 

Disadvantages 

■ No Rank ordering 

■ Distribution 
independent 

■ Complex 
exponential 
function 

■ No Rank ordering 

■ Best for normal 
distribution 

■ Data / information 
loss 

■ Computationally 
complex 

■ Over-dependency 
on distribution 

■ Complex technique 


In general, it has been observed: 

s Sigma Approach comes out to be the best technique in case of Logistic Regression 
s Mahalanobis Distance Technique is best for Linear Regression 

Note: Logistic regression corresponds to binary dependent variable; 

Linear regression is run to model a continuous dependent variable. 
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Exercise 


Exercise 1. For the given 100 records, _DEPVAR_ is the dependent variable and XI is a predictor. Analyze 
distribution of XI, identify outliers, treat them using all 5 techniques and compare the results. 


[Hint For using SAS macros, refer Appendix A.1 - A.5] 
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10 

203 

72 

171 

1 

293 

97 

196 

0 

23 

73 

172 

5 

656 

98 

197 

8 

65 

74 

173 

9 

81 

99 

198 

8 

647 

75 

174 

1 

180 

100 

199 

10 

489 

76 

175 

2 

182 

101 

200 

5 

441 
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Chapter 3: Data Prep: Missing Value Imputation 
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3. Missing Value Imputation 



MVI is a process of replacing missing values of a variable with the best possible estimates. 


Missing Identification 


Missing structure has to be identified for all variables. 


Variable Type 

Missing Value 

Numeric 


Character 

<blank> 


Along with above cases, both numeric and character variables can take invalid values which 
should be converted to missing, e.g. 


Impact of Missing Data 


Solution 


Variable 

Invalid Value 

Income 

9999999 

City 

XX 


Most multivariate analysis techniques, especially regression, drop all observations with missing 
values. 

Missing Value Imputation - replace missing values with estimates. 


Word of Caution 


An incorrect imputation can result in an incorrect estimation / prediction. 


Xexl 
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There are a variety of techniques for missing value imputation; but these should be considered 
more as scenario-specific than just being a set of pure alternative choices. 

Missing Value Imputation Techniques 

A. Impute Missing Values with ZERO 

B. Impute Missing Values with MEDIAN 

C. Impute Missing Values with MEAN 

D. Impute Missing Values with MODE 

E. Information based Segmentation 

F. Non-Missing Dummy Creation 

G. Imputation and Non-Missing Dummy Creation 

H. Impute based on Bivariate Graphs 

I. Impute using Regression on other Non-Missing Predictors 

J. DNI 

K. Multiple Imputation 
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3.1 Impute Missing Values with ZERO 



MVI TECHNIQUES 

i-1 

I A. Impute Missing Values with ZERO | 

B. Impute Missing Values with MEDIAN 

C. Impute Missing Values with MEAN 

D. Impute Missing Values with MODE 

E. Information based Segmentation 

F. Non-Missing Dummy Creation 

G. Imputation and Non-Missing Dummy 
Creation 

H. Impute based on Bivariate Graphs 

I. Impute using Regression on other Non- 
Missing Predictors 

J. Impute using CART 

K. DNI 

L. Multiple Imputation 


Technique 

Description 

Execution 


Example 


Application 

& 

Evaluation 

Levers 


Impute missing values with ZERO. 


■ Run EDD 

■ Identify numeric variables with missing values 

■ Check if there are same number of missing values for a particular 
type of variables (whose source dataset is a subset of other 
variable source datasets) 

■ If it makes sense, impute missing values with 0 


Consider a case where 

■ Modeling dataset has 1 million records 

■ Debt history is populated for those in debt (400K records) 


Variable 

#Obs. 

#Missing 

Data Source 

age 

1,000,000 

0 

Demographics 

ind_female 

1,000,000 

0 

Demographics 

ind_payment_due 

1,000,000 

600,000 

Debt History 

due_amt 

1,000,000 

600,000 

Debt History 


It makes sense to impute 600K missing records with ZERO. 


Applicable for variables 
whose missing values should 
actually have been taking 

1 

Production Ready 

High 

1 

Easy to Understand 

High 

1 

Business Implication 

High 

ZERO value anyway. 

1 

Approximation 

High 


1 

Time Involved 

Low 


1 

Coverage 

Hi g h V 
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3.2 Impute Missing Values with MEDIAN 



MVI TECHNIQUES 

A. Impute Missing Values with ZERO 

I-1 

| B. Impute Missing Values with MEDIAN 

C. Impute Missing Values with MEAN 

D. Impute Missing Values with MODE 

E. Information based Segmentation 

F. Non-Missing Dummy Creation 

G. Imputation and Non-Missing Dummy 
Creation 

H. Impute based on Bivariate Graphs 

I. Impute using Regression on other Non- 
Missing Predictors 

J. Impute using CART 

K. DNI 

L. Multiple Imputation 


Technique 

Description 


Execution 


Application 

& 

Evaluation 

Levers 


Impute missing values with MEDIAN. 


Run EDD 

Identify numeric continuous variables with missing values 

Check variable distribution 

Identify variables with highly skewed distribution 

If it makes sense, impute missing values with median value 



Consider a variable with following distribution: 

Variable XYZ 


Variable distribution is highly 


N 

800 

NMISS 

15 

skewed. Extremely large 

1 MEAN 

355 

MEDIAN 

• — — — j 

51 1 

values for ~5% data are 

L_ 




distorting the mean value, 

MIN 

0 

MAX 

10,000 

thereby making it significantly 

PI 

2 

P99 

5,000 

different from the median 
value. Here, it makes sense to 

P5 

5 

P95 

4,750 

impute 15 missing records with 

P10 

9 

P90 

88 

MEDIAN. 

P25 

26 

P75 

75 



Applicable for continuous 
variables whose distribution 
is highly skewed. 


Production Ready 

High 

Easy to Understand 

High 

Business Implication 

Medium 

Approximation 

Medium 

Time Involved 

Low 

Coverage 

High 
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3.3 Impute Missing Values with MEAN 



MVI TECHNIQUES 

A. Impute Missing Values with ZERO 

B. Impute Missing Values with MEDIAN 

I-1 

I C. Impute Missing Values with MEAN 

D. Impute Missing Values with MODE 

E. Information based Segmentation 

F. Non-Missing Dummy Creation 

G. Imputation and Non-Missing Dummy 
Creation 

H. Impute based on Bivariate Graphs 

I. Impute using Regression on other Non- 
Missing Predictors 

J. Impute using CART 

K. DNI 

L. Multiple Imputation 
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3.4 Impute Missing Values with MODE 



MVI TECHNIQUES 

A. Impute Missing Values with ZERO 

B. Impute Missing Values with MEDIAN 

C. Impute Missing Values with MEAN 

I-1 

I D. Impute Missing Values with MODE 

E. Information based Segmentation 

F. Non-Missing Dummy Creation 

G. Imputation and Non-Missing Dummy 
Creation 

H. Impute based on Bivariate Graphs 

I. Impute using Regression on other Non- 
Missing Predictors 

J. Impute using CART 

K. DNI 

L. Multiple Imputation 
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3.5 Information based Segmentation 



MVI TECHNIQUES 

A. Impute Missing Values with ZERO 

B. Impute Missing Values with MEDIAN 

C. Impute Missing Values with MEAN 

D. Impute Missing Values with MODE 

I-1 

| E. Information based Segmentation 

F. Non-Missing Dummy Creation 

G. Imputation and Non-Missing Dummy 
Creation 

H. Impute based on Bivariate Graphs 

I. Impute using Regression on other Non- 
Missing Predictors 

J. Impute using CART 

K. DNI 

L. Multiple Imputation 


Technique 

Description 


Execution 


Application 

& 

Evaluation 

Levers 


Use a classifier to segment population with high missing rate on a 
set of variables. 


Identify classifier by business logic or by iterative analysis 
Segment population into sub populations based on classifier 
Resulting sub populations should have high fill rate 
Impute with mean/median/zero (as applicable) 



Low Fill Rate 


Individual Liable 

100 Accounts 


1 


Var. N NMISS 

Fill Rate 

A 100 88 

12.0% 

All Accounts 

B 100 100 

0.0% 

150 Accounts 

C 100 2 

98.0% 

Var. N NMISS 

Fill Rate 


A 150 90 

40.0% 

Corporate Liable 

B 150 105 

30.0% 

50 Accounts 

C 150 50 

66.7% 

Var. N NMISS 

Fill Rate 

1 


C\J 

O 

LO 

< 

96.0% 


_► B 50 5 

90.0% 



C 50 48 

4.0% 


There is a case for segment level modeling. Instead of dropping all three, C can be 
used in ‘Individual Liable’ segment model, while A and B can be used in the other 
one. 


Applicable where correct 
imputation is highly 
dependent on value of a 
classifier. 


Production Ready 
Easy to Understand 
Business Implication 
Approximation 
Time Involved 
Coverage 


High 
High 
High 
Low 
High 

Hi g h V 
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3.7 Non-Missing Dummy Creation 



MVI TECHNIQUES 

A. Impute Missing Values with ZERO 

B. Impute Missing Values with MEDIAN 

C. Impute Missing Values with MEAN 

D. Impute Missing Values with MODE 

E. Information based Segmentation 

F. Non-Missing Dummy Creation 

I G. Imputation and Non-Missing Dummy) 
Creation 

H. Impute based on Bivariate Graphs 

I. Impute using Regression on other Non- 
Missing Predictors 

J. Impute using CART 

K. DNI 

L. Multiple Imputation 


Technique 

Description 


Execution 


Application 

& 

Evaluation 

Levers 


Create a non-missing dummy and also retain original variable. 


Run EDD 

Run bivariate graphs to identify variables with high correlation with 

dependent variable 

Create non-missing dummies 

Impute missing with Mean/Median (as applicable) 

Use both imputed value & non-missing dummy as predictors 



Consider the following case: 


HH SIZE 



HH_ 

SIZE 


1_ N 

100 

NMISS 

70 1 

MEAN 

5.5 

MEDIAN 

5 

MIN 

1 

MAX 

100 

PI 

2 

P99 

11 

P5 

3 

P95 

10 

P10 

3 

P90 

8 

P25 

4 

P75 

6 


ll 


III 


* IQ l/iviing 


Non-Missing Indicator Creation: 
i_NMI_HH_SIZE = (HH_SIZE ne .); 
Missing Imputation by Median: 

If (HH_SIZE eq .) then HH_SIZE = 5; 


Applicable when missing rate 
is high and the variable has a 
quite high correlation with the 
dependent variable. 


Production Ready 

High 

Easy to Understand 

High 

Business Implication 

Medium 

Approximation 

Low 

Time Involved 

Medium 

Coverage 

Medium^ 
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3.8 Impute based on 


Graphs 



MVI TECHNIQUES 

A. Impute Missing Values with ZERO 

B. Impute Missing Values with MEDIAN 

C. Impute Missing Values with MEAN 

D. Impute Missing Values with MODE 

E. Information based Segmentation 

F. Non-Missing Dummy Creation 

G. Imputation and Non-Missing Dummy 
Creation 


I-1 



I. Impute using Regression on other Non- 


Missing Predictors 

J. Impute using CART 

K. DNI 

L. Multiple Imputation 
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3.9 Impute using Regression 


MVI TECHNIQUES 

A. Impute Missing Values with ZERO 

B. Impute Missing Values with MEDIAN 

C. Impute Missing Values with MEAN 

D. Impute Missing Values with MODE 

E. Information based Segmentation 

F. Non-Missing Dummy Creation 

G. Imputation and Non-Missing Dummy 
Creation 

H. Impute based on Bivariate Graphs 

I I. Impute using Regression on other Non-| 
Missing Predictors 

J. Impute using CART 

K. DNI 

L. Multiple Imputation 
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3.10 Impute using CART 



MVI TECHNIQUES 

A. Impute Missing Values with ZERO 

B. Impute Missing Values with MEDIAN 

C. Impute Missing Values with MEAN 

D. Impute Missing Values with MODE 

E. Information based Segmentation 

F. Non-Missing Dummy Creation 

G. Imputation and Non-Missing Dummy 
Creation 

H. Impute based on Bivariate Graphs 

I. Impute using Regression on other Non- 
Missing Predictors 

-1 

J. Impute using CART | 

K. DNI 

L. Multiple Imputation 


Technique 

Description 


Execution 


Application 

& 

Evaluation 

Levers 


Develop a CART classification/regression tree with non-missing 
records to predict the value of missing variable. 


Run EDD 

Identify variables with high fill rate (say, more than 80%) 

Build a CART model on the non missing population using selected 
variables to predict the missing variable 
Impute the missing variable using CART model 


Imputing Income: 



Income Missing 



Impute Income Using 
CART Model 


Income Present 
& Other 

Predictors Have 
High Fill Rate 



Build CART 
Regression Tree 
Using Income As 
Target Variable 


Applicable for variables that 
have reliable predictive 
relationship with other 
independent variables. 


Production Ready 
Easy to Understand 
Business Implication 
Approximation 
Time Involved 
Coverage 


Medium 

Medium 

Medium 

Medium 

High 

Medium^ 


V 
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DART 


Exercise 


Exercise 2. For the given 50 records, _DEPVAR_ is the dependent variable, XI and X2 are predictors. Impute 
missing values of X2 based on 

1. Distribution of X2 only 

2. Distribution of X2 and DEPVAR 



3. 

Distribution of X2, 

_DEPVAR_ 

and XI 







A 

B 

c 

D 


A 

B 

c 

D 

1 

ID 

_DEPVAR_ 

XI 

X2 

1 

ID 

_DEPVAR_ 

XI 

X2 

2 

101 

0 

287 

0 

27 

126 

0 

798 

191 

3 

102 

1 

596 


28 

127 

1 

590 


4 

103 

0 

885 

168 

29 

128 

1 

207 

105 

5 

104 

0 

109 

424 

30 

129 

0 

622 

403 

6 

105 

0 

671 

232 

31 

130 

0 

954 


7 

106 

0 

699 

306 

32 

131 

1 

802 

120 

8 

107 

0 

287 

402 

33 

132 

1 

652 

336 

9 

108 

0 

420 

271 

34 

133 

0 

960 

162 

10 

109 

1 

529 

498 

35 

134 

0 

384 

197 

11 

110 

1 

534 

495 

36 

135 

0 

721 

19 

12 

111 

0 

917 

126 

37 

136 

0 

648 

165 

13 

112 

0 

112 

249 

38 

137 

0 

824 

366 

14 

113 

0 

297 

394 

39 

138 

0 

944 

393 

15 

114 

1 

338 

423 

40 

139 

0 

823 

471 

16 

115 

1 

137 

444 

41 

140 

1 

807 


17 

116 

1 

664 


42 

141 

0 

440 

56 

18 

117 

0 

363 

398 

43 

142 

1 

446 

465 

19 

118 

0 

758 

332 

44 

143 

0 

358 

194 

20 

119 

0 

429 

208 

45 

144 

0 

847 

398 

21 

120 

1 

190 


46 

145 

0 

646 

381 

22 

121 

0 

435 

41 

47 

146 

1 

700 

165 

23 

122 

0 

727 

130 

48 

147 

0 

702 

302 

24 

123 

0 

761 

14 

49 

148 

0 

717 

500 

25 

124 

0 

142 

370 

50 

149 

0 

424 

481 

26 

125 

0 

242 

188 

51 

150 

1 

949 

434 
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Chapter 4: Post Outlier Treatment and Imputation 
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4.1 Identify Non Usable Variables 



Even at this early stage one can identify certain variables which can be deemed as ‘non-usable for modeling 
purpose’. This way we can reduce the dimension of the dataset. Some logics that can be applied are as follows: 

■ Variables with a single unique value throughout the dataset: By definition, such variables have zero explanatory power 
and hence are irrelevant for any analysis. These variables are usually flags like merge indicators. 

■ ID Variables Such variables may be needed in the dataset for observation tagging. However, they should NOT be used as 
predictors in the model. 

■ Variables with very low fill rates: 

Case I: Variable, in question, is defined over a specific segment only. This segment may be used to subset the modeling 
dataset for developing segment-specific models. In such a case, the same variable is usable for one segment; while non- 
usable for the other. 

Case II: Missing value may signify something; and may be associated with a meaningful value. 

Case III: Variable fill rate is less than even 50% but there is a strong business case for its inclusion. In this case, the 
appropriate technique of missing value imputation should be applied. 

Case IV: If none of the above cases holds, some minimum fill rate cut-off may be put for dataset dimension reduction. 
According to standard modeling conventions, any variable with fill rate lower than 50% is not included in the model. This 
cut-off for fill rate can be set higher or lower depending on how well populated is the data received. 


■ Variables which cannot be used because of implementation issues should be dropped. 


Certain variable like Gender, Ethnicity which cannot be used due to regulatory issues (depending upon the business 
problem in context) should also be dropped. 
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4.2 Reformat Variables 



Categorical and continuous variables are treated differently in most of the analysis like CART, 
Logistic Regression, Bivariate analysis (as continuous variables would require binning and banding 
whereas categorical won’t). Hence, it’s always advisable to separate out possible categorical 
variables from the continuous ones. 

Few points to remember 

■ Look at EDD to check variable format. However, it is possible that variable format is not correct in 
data itself. Variable format type column in EDD can’t help in such exceptions. 

■ Check number of unique values. Numerical variables taking only 10-15 unique value may be 
treated as categorical. It’s a subjective call, depending on the variable and its expected use in 
model. 

■ Apply business sense before treating variables as continuous / categorical 



A numeric variable should never be converted to a categorical variable if the values have 
ordered meaning, even if the number of its unique values is just 3 or 4. 


Xexl 
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4.3 Immediate Next Steps 



Some Pointers as Immediate Next Steps 

■ Redundant variables should be dropped. Such variables don’t add any extra information. To identify such 
variables, variable reduction techniques can be used like variable clustering. 

■ Few predictors may be highly correlated. In such a situation, coefficient estimates may change erratically in 
response to small changes in the data. This problem of ‘multicollinearity’ should be taken care of. 

■ In case of categorical dependent variable, event rate needs to be looked upon. If event rate is too low, it may 
create a problem in developing a robust model. The modeler may need to do oversampling. 


These concepts (as components of data analysis and modeling) would be covered in more detail in the next 


module. 
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hook deeper 




A.1. Macro Call: Capping and Flooring 


DART 


Capping and Flooring Macro (X times) Syntax 

% OUTL_TREATMENT_X_T IME S ( 

EDD_LOC_LIB 
EDD_LOC_DATASET - 
LIB_IN 
DATA_IN 
LIB_OUT 
DATA_OUT 

NO_TREAT_VARL1ST - 
UNIQUE_ID 
X_T IME S_CAP 
X_TIME S_FLOOR 

_ u _ 

This macro detects and treats the outlier values using P99 and PI for numeric variables. 


FLOORING 


CASES 





LOGIC 



ILLUSTRATION (Let X = 5) 

MIN 

PI 



OUTLIER 

TREATMENT 

MIN 

PI 

TREATMENT 

CASE 1 

< 0 

<0 

Any value 

< 

X* PI 

Floor at X * PI 

-200 

- 10 

Floor at -50 

CASE II 

< 0 

= 0 

Any value 

< 

-X 

Floor at - X 

-200 

0 

Floor at -5 

CASE III 

< 0 

>0 

Any value 

< 

PI -(X* PI) 

Floor at PI -(X* PI) 

-200 

10 

Floor at -40 

CASE IV 

>0 

>0 

Any value 

< 

PI /X 

Floor at PI / X 

1 

10 

Floor at 2 


<Library where EDD in form of SAS dataset is located>, 
<Name of EDD SAS Dataset>, 

<Library of input dataset>, 

<Name of input dataset>, 

<Library of output dataset(outlier treated dataset )>, 
<Name of output dataset i.e. Outlier treated dataset>, 
<VARLIST for no outlier treatment (separated by space )>, 
<Unique identifier of input dataset>, 

<Outlier factor on P99 side>, 

<Outlier factor on PI side> 


CAPPING 


CASES 



LOGIC 



ILLUSTRATION (Let X = 5) 

P99 

MAX 

OUTLIER 

TREATMENT 

P99 

MAX 

TREATMENT 

CASE 1 

>0 

>0 

Any value > X * P99 

Cap at X * P99 

10 

200 

Cap at 50 

CASE II 

= 0 

>0 

Any value > X 

Cap at X 

0 

200 

Cap at 5 

CASE III 

< 0 

>0 

Any value > P99 - (X * P99) 

Cap at P99 - (X * P99) 

- 10 

200 

Cap at 40 

CASE IV 

< 0 

<0 

Any value > P99 / X 

Cap at P99 / X 

- 10 

- 1 

Cap at -2 


AEXL 
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A.2. Macro Call: Exponential Smoothing 


DART 


Exponential Smoothing Syntax 

% OUTL_TREATMENT_EXP_SMOOTH 

( 

EDD_LOC_LIB 

EDD_LOC_DATASET 

LIB_IN 

DATA_IN 

LIB_OUT 

DATA_OUT 

NO_TREAT_VARL1ST 
UNIQUE_ID 
X_T IME S_CAP 
X_TIMES_FLOOR 

); 


<Library where EDD in form of SAS dataset is located>, 
<Name of EDD SAS Dataset>, 

<Library of input dataset>, 

<Name of input dataset>, 

cLibrary of output dataset(outlier treated dataset )>, 
<Name of output dataset i.e. Outlier treated dataset>, 
<VARLIST for no outlier treatment (separated by space )>, 
<Unique identifier of input dataset>, 

<Outlier factor on P99 side>, 

<Outlier factor on PI side> 


This macro detects and treats the outlier values using exponential smoothing technique. 


In this technique, the curve between P95 to P99 is extrapolated beyond P99, to identify the values falling above the curve. The values 
falling outside the curve are outliers and are treated according to some functions depending upon the boundary conditions. 

Advantages 

■ Rank order is maintained 

■ Distribution of data is taken into account while identifying the outliers 

■ Run time is less 

Disadvantages 

■ Functions involved in treating the outliers are quite complex 
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A.3. Macro Call: Sigma Approach 


DART 


Sigma Approach Macro Syntax 

% OUTL_TREATMENT_S IGMA 

( 

EDD_LOC_LIB 

EDD_LOC_DATASET 

LIB_IN 

DATA_IN 

LIB_OUT 

DATA_OUT 

NO_TREAT_VARL1ST 
UNIQUE_ID 
X_T IME S_CAP 
X_TIMES_FLOOR 

); 


<Library where EDD in form of SAS dataset is located>, 
<Name of EDD SAS Dataset>, 

<Library of input dataset>, 

<Name of input dataset>, 

<Library of output dataset(outlier treated dataset )>, 
<Name of output dataset i.e. Outlier treated dataset>, 
<VARLIST for no outlier treatment (separated by space )>, 
<Unique identifier of input dataset>, 

<Outlier factor on P99 side>, 

<Outlier factor on PI side> 


This macro detects and treats the outlier values using sigma approach. In this technique, the outliers are identified and treated based 
upon the values of mean and standard deviation (sigma). The macro uses simple boundary condition to check for the outliers. 

Capping: Any value greater than (mean + X_TIMES_CAP * sigma) is an outlier 

Imputed value = (mean + X_TIMES_CAP * sigma) 

Flooring: Any value less than (mean - X_TIMES_FLOOR * sigma) is an outlier 

Imputed value = (mean - X_TIMES_FLOOR * sigma) 


Advantages 

■ Easy to understand & implement 

■ Run time is less 

Disadvantages 

■ Rank order is not maintained 

■ This method works best only when variables follow a normal distribution 
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A.4. Macro Call: Robust Regression 


DART 


Robust Regression Macro Syntax 

%ROBUSTREG_OUTLIER 

( 

IN_DATA = <Library and name of input dataset>, 

0UT_DATA1 = <Library and name of output dataset with info, on identified outlier records>, 

0UT_DATA2 = <Library and name of output dataset after removing outlier records>, 

0UT_DATA3 = <Library and name of output dataset with outlier records>, 

VAR_LIST = <VARLIST for no outlier treatment (separated by space )>, 

DEP_VAR = <Name of dependent variable>, 

UNIQUE_ID = <Unique identifier of input dataset> 

); 


This macro detects and treats outliers by using ROBUSTREG procedure in SAS. 


Outlier Identification: 

■ This technique involves running regression repeatedly to identify outliers by assigning weights to the observations. The weights 
are on the basis of the prediction error (residual) in different iterations. Higher residual means lower weight. 

Outlier treatment: 

■ Observations with zero weight are marked as outliers and need to be removed from the data for any kind of analysis 

Advantages 

■ Effect of outliers on model performance is minimized 

Disadvantages 

■ Ignoring outliers may result in loss of data/information. 

■ Computation of robust estimates is resource intensive. It takes ~30 minutes for running on 600 variables but for 2000 variables it 
takes ~100hrs (4days) 

46 | June 30, 2015 | © 2015 ExIService Holdings, Inc. 


EXL 




A.5. Macro Call: Mahalanobis Distance 


DART 


Mahalanobis Distance Macro Syntax 


%OUTLIER_MD ( 

IN_DATA 

VAR 

UNIQUE_ID 

OUT_DATA 

); 


= <Library and name of input dataset>, 

= <VARLIST for no outlier treatment (separated by space )>, 

= <Unique identifier of input dataset>, 

= <Library and name of output dataset (dataset with weight for OUTLIERS)> 


This macro detects outliers using Mahalanobis distance approach. This macro creates a weight variable (OT_WT) which, when used in 
regression, reduces the effect of outliers. 

In this technique, the outliers are identified by the magnitude of “Mahalanobis” or statistical distance from the origin. To each observation, 
weight is given as the inverse of “Mahalanobis” distance. 


Outlier Identification 

■ The observation with extremely low weights can be considered to be outliers 

Outlier Treatment 

■ Weighted regression is run to take into account the effect of weight given to each observation. (Greater the “Mahalanobis” 
distance, lesser is the weight of that observation & hence lesser the contribution of that observation in final model.) 

Advantages 

■ Effect of outliers on model performance is minimized 

Disadvantages 

■ A minor change in data distribution would lead to more than normal deterioration of the model performance 

■ Complexity in calculation of Mahalanobis distance for weighted regression 


Xexl 
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A.6. Data Collection Process: Details 


Identify Data Needs Data Mapping 


Plan Data Request 


Send Data Request 


Quality Check 


■ Start with Business 
Question 

■ Determine data 
need for delivering 
desired outcome 

Illustration: 

Business Question: 

How to match the most 
profitable credit 
product with each new 
customer? 

Solution: 

Use Credit & Payment 
History and Financial 
Statement data to 
predict account 
performance for 
different products. 

Data Request: 

A representative 
sample of customers 
from each product with 
usage and payment 
data for sufficient no. 
of months along with 
their credit score 


Assess Available 
Population Coverage 

■ Data availability 
constraints; viz. archives 
time span 

■ Population sizing by key 
characteristics like credit 
history 

■ Discuss with client 
unexpected size 
limitations 

Assess Alternative Data 

Sources 

■ Choose between 
alternatives 

■ Ensure that link keys 
work between sources 
chosen 

Plan to Optimize Client 

Resource Use 

Minimize workload for client 
IT department; even if it 
makes more work at our end 
to link files, convert media, 
reformat etc. 


Become as familiar as 
possible with the data 
sources and their 
content 


Data Mapping has 

basically three major 
components: 

■ Interview clients 

■ Obtain & study data 
layouts 

■ Obtain & evaluate 
data samples 

Note: 

The results of each 
step may require us 
to repeat one or more 
previous steps. 


Be as specific as possible! 

■ Accurate file names 

■ Specify selection criteria with 
respect to actual field names 
and value formats (e.g. 
“Values of the field 
STATE_CD in the subset = 

(IN,Ml)” rather than “Records 
from Indiana and Michigan”) 

■ Specify required or acceptable 
file formats 

■ Give detailed randomization 
and/or stratification 

Note: 

In case of Account x 
Transaction level data, 
random sampling of records 
is not the same as random 
sampling of accounts. 

■ Prepare the driver file 


Always examine results 
before acceptance! 

For each data file received, 

■ Compare basic statistics (no. 
of records, no. of fields, range 
of values in each field) to 
expectations and resolve 
discrepancies 

■ Ensure that delimiters, file 
format and record format 
meet requirements 

■ Ensure that the data 
dictionary matches the file 
exactly 

■ Enter file into data inventory, 
recording basic information 
(file name, date received, file 
size, record length, SPOC) 

While merging files, 

■ Watch out for identical 
merge-key field name with 
different meanings in two files 

■ Beware of the consequences 
of merging two datasets with 
few identically named non¬ 
key fields 

■ Specify a distinct output file 
for sorting 
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Thanks 

For queries, contact Varun Aggarwal at Varun.Aqqarwal@exlservice.com 
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